RNA-seq : Hisat2+Stringtie+DESeq2

Original JunJunLab 老俊俊的生信笔记 2022-08-15

收录于合集 #RNA-seq 39个

RNA-seq : Hisat2+Stringtie+DESeq2

RNA-seq 即转录组测序技术，就是用高通量测序技术进行测序分析，反映出 mRNA，smallRNA，noncodingRNA 等或者其中一些的表达水平，寻找表达差异的基因预测或验证相关的分子机制及功能。

2016 年发表在 nature protocols 上一篇关于转录本精确定量^[1]的文章：

文章中以 HISAT + Stringtie + Ballgrown 的 pipeline 对 RNA-seq 进行转录本的组装和精确定量，但是后来 2017 年有一篇nature communications^[2]的文章提示，下游 Ballgrown 软件并不是能很好的做差异分析，DESeq2 差异软件会有较好的效果。

HISAT 软件是 Tophat 的升级版，具有更快的运行速度，更高的准确性，需要更低的内存来运行，目前已经更新到2.2.1 版本^[3]了。

基本流程（可选）：
1.比对 > 转录本组装 > 定量
2.比对 > 定量
如果需要鉴定分析新的转录本并定量就选第一个流程

开始分析

目录结构：

1.RNA-seq 数据获得方式：

1、文章中的 GSE 编号到GEO^[4]数据库下载
2、ENA^[5]数据框搜索下载
3、sra-exploer^[6]下载
4、公司测序的自己课题数据
sra-exploer 下载可获得 ftp，http，aspera 等下载地址，还可以自动命好名，比较方便

我的测试文件是双端测序，每个样本两个生物学重复。

2.准备环境

# 创建分析环境（前提是已经安装好conda！，不会的自行百度）
conda create -n rnaseq python=2
# 进入环境
conda activate rnaseq
# 安装所需软件，耐心等待
conda install -y fastqc multiqc hisat2 stringtie samtools

3.质量检查

# 新建目录，把fastq文件放到1.fastq-data文件夹下
$ ls 1.fastq-data/
test1_R1_rep1.fq.gz  test1_R2_rep1.fq.gz  test2_R1_rep1.fq.gz  test2_R2_rep1.fq.gz
test1_R1_rep2.fq.gz  test1_R2_rep2.fq.gz  test2_R1_rep2.fq.gz  test2_R2_rep2.fq.gz
# 新建qc文件夹
$ mkdir 0.qc
# 质检，-t使用线程数，-o输出目录，最后是输入文件
$ fastqc -t 10 -o 0.qc/ 1.fastq-data/*.gz
Started analysis of test1_R1_rep1.fq.gz
Started analysis of test1_R2_rep1.fq.gz
Started analysis of test2_R1_rep1.fq.gz
Started analysis of test2_R2_rep1.fq.gz
...

# 质检结束后，整合所有质检结果
$ multiqc -o 0.qc/ 0.qc/*

质检完成后，汇总 0.qc 文件夹下产生 html 结尾的文件，双击在网页中打开，具体每个图的解释参考这篇文章：FastQC 数据质控报告的详细解读^[7]
如果含有接头序列则需要去除接头序列，去接头序列软件有很多，cutapdat、trim-galore、fastp、trimmomatic、fastx-tookit 等，使用方法可参照软件参考文档

3.1 质检结果

可以看到测序数据质量都是很高的，也没有接头序列污染。

4.比对前准备参考基因组和注释文件

比对需要建立索引，这样才能方便比对软件进行快速的比对，我们可以手动建立索引或者直接去官网下载已经建好的索引^[8]，官网有 6 个物种的索引。自己建索引需要较高的内存和较长的时间，不推荐。
人的索引建立需要至少 160G 内存和 2 个小时。

4.1 下载构建好的索引

>>
__snp 加入了单核苷酸多态性的信息
__tran 加入了转录本的信息

# 建立索引存放文件夹
$ mkdir 2.index
$ cd 2.index
# 下载，文件3.66个G，下载速度慢可以用迅雷等加速软件
$ wget https://cloud.biohpc.swmed.edu/index.php/s/grcm38_tran/download
$ ls
total 3840560
-rwxrwxrwx 1 root root 3932732963 Jun 11 16:24 grcm38_tran.tar.gz
# 解压
$ tar -zxvf grcm38_tran.tar.gz
grcm38_tran/
grcm38_tran/genome_tran.2.ht2
grcm38_tran/genome_tran.5.ht2
grcm38_tran/genome_tran.4.ht2
grcm38_tran/genome_tran.3.ht2
grcm38_tran/genome_tran.7.ht2
grcm38_tran/genome_tran.6.ht2
grcm38_tran/genome_tran.1.ht2
grcm38_tran/genome_tran.8.ht2
grcm38_tran/make_grcm38_tran.sh
# 查看解压的文件有grcm38_tran文件夹，里面有哪些文件？
$ cd grcm38_tran/ && ls -l
total 5017056
-rwxrwxrwx 1 root root 1639574274 Mar 18  2016 genome_tran.1.ht2
-rwxrwxrwx 1 root root  664305504 Mar 18  2016 genome_tran.2.ht2
-rwxrwxrwx 1 root root       6119 Mar 18  2016 genome_tran.3.ht2
-rwxrwxrwx 1 root root  663195875 Mar 18  2016 genome_tran.4.ht2
-rwxrwxrwx 1 root root 1482328835 Mar 18  2016 genome_tran.5.ht2
-rwxrwxrwx 1 root root  675618574 Mar 18  2016 genome_tran.6.ht2
-rwxrwxrwx 1 root root   10349748 Mar 18  2016 genome_tran.7.ht2
-rwxrwxrwx 1 root root    2070619 Mar 18  2016 genome_tran.8.ht2
-rwxrwxrwx 1 root root       2480 Mar 18  2016 make_grcm38_tran.sh

4.2 自己构建索引

下载 GTF 注释文件方便后面对基因或者转录本进行定量，genome 和 gtf 文件可以从 UCSC、ensembl、gencode 三大数据库获得，我们去ensembl^[9]下载基因组和注释文件
点击下面的小鼠：

下载完后建立索引：

# 提取剪接位点和外显子信息
$ extract_splice_sites.py Mus_musculus.GRCm39.104.gtf > Mus_musculus.ss
$ extract_exons.py Mus_musculus.GRCm39.104.gtf > Mus_musculus.exon
# 建立索引
$ hisat2-build --ss Mus_musculus.ss \
               --exon Mus_musculus.exon \
               Mus_musculus.GRCm39.dna.primary_assembly.fa \
               Mus_musculus.GRCm39_tran
# 接下来是漫长的等待过程

5.比对到参考基因组
hisat2 参数用法可参考：转录组比对软件 HISAT2 的使用说明^[10]

# 建立比对结果文件夹
$ mkdir 3.map-data
# hisat2 基本用法
$ hisat2
HISAT2 version 2.2.1 by Daehwan Kim (infphilo@gmail.com, www.ccb.jhu.edu/people/infphilo)
Usage:
  hisat2 [options]* -x <ht2-idx> {-1 <m1> -2 <m2> | -U <r>} [-S <sam>]
# -x跟索引名前缀，-1，-2跟双端测序文件，-U跟单端测序文件，-S输出为sam格式的文件,-p线程数量
# 我们直接输出为排序好的bam文件
# 单个样本比对，--dta输出为转录本组装的reads，-@为samtools的线程数，--summary-file输出比对信息
$ hisat2 -p 10 --dta -x 2.index/grcm38_tran/genome_tran \
         --summary-file 3.map-data/test1_rep1_summary.txt \
         -1 1.fastq-data/test1_R1_rep1.fq.gz \
         -2 1.fastq-data/test1_R2_rep1.fq.gz \
         | samtools sort -@ 10 -o 3.map-data/test1_rep1.sorted.bam
# 然后比对四次就可以了

批量比对：

# 创建一个脚本文件然后输入批量比对的脚本
$ vi map.sh
#!/bin/bash
# batch map to genome
for i in test1 test2
do
    for j in rep1 rep2
    do
        hisat2 -p 10 --dta -x 2.index/grcm38_tran/genome_tran \
               --summary-file 3.map-data/${i}_${j}_summary.txt \
               -1 1.fastq-data/${i}_R1_${j}.fq.gz \
               -2 1.fastq-data/${i}_R2_${j}.fq.gz \
               | samtools sort -@ 10 -o 3.map-data/${i}_${j}.sorted.bam
    done
done
# 保存后挂后台运行
$ nohup ./map.sh &

查看后台命令运行情况：

$ htop

可以看到 12 个线程都在跑，用了 10 多个 G 的运行内存

比对结果解读，随便打开一个 test1_rep1_summary.txt 查看：

$ less 3.map-data/test1_rep1_summary.txt
20555443 reads; of these:
  20555443 (100.00%) were paired; of these:
    1680108 (8.17%) aligned concordantly 0 times
    17649450 (85.86%) aligned concordantly exactly 1 time
    1225885 (5.96%) aligned concordantly >1 times
    ----
    1680108 pairs aligned concordantly 0 times; of these:
      203713 (12.12%) aligned discordantly 1 time
    ----
    1476395 pairs aligned 0 times concordantly or discordantly; of these:
      2952790 mates make up the pairs; of these:
        1465464 (49.63%) aligned 0 times
        1287384 (43.60%) aligned exactly 1 time
        199942 (6.77%) aligned >1 times
96.44% overall alignment rate

可以看到总共有 20555443 条 reads，8.17%没有比对上，85.86%能够唯一比对上，5.96%能比对到多个位置，比对率还是很高的。

6.转录本组装和合并

# 建立组装文件文件夹
$ mkdir 4.assembl-data
# 下载gtf文件
$ wget http://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz
# 解压gtf文件
$ gunzip Mus_musculus.GRCm38.102.gtf.gz
# 组装转录本，-p为线程数，-G为组装参考注释文件，-l为输出文件名前缀
# 单个样本运行
$ stringtie -p 10 -G Mus_musculus.GRCm38.102.gtf \
            -l test1_rep1 \
            -o 4.assembl-data/test1_rep1.gtf \
            3.map-data/test1_rep1.sorted.bam

批量组装：

# 创建一个脚本文件然后输入批量组装的脚本
$ vi assembl.sh
#!/bin/bash
# assembl
for i in test1 test2
do
        for j in rep1 rep2
        do
                stringtie -p 10 -G ./Mus_musculus.GRCm38.102.gtf \
                          -l ${i}_${j} \
                          -o 4.assembl-data/${i}_${j}.gtf \
                          3.map-data/${i}_${j}.sorted.bam
        done
done
# 后台运行
$ nohup ./assembl.sh &

6.1 合并转录本
建立一个 gtf 的 list 文件，里面为上一步输出的文件的路径

# 创建 mergelist 文件
$ ls -l 4.assembl-data/*.gtf | awk '{print $9}' > 4.assembl-data/mergelist.txt
# 查看一下
$ head -3 4.assembl-data/mergelist.txt
4.assembl-data/test1_rep1.gtf
4.assembl-data/test1_rep2.gtf
4.assembl-data/test2_rep1.gtf
# 合并gtf文件
$ stringtie --merge -p 10 -G ./Mus_musculus.GRCm38.102.gtf \
                    -o stringtie_merged.gtf \
                    4.assembl-data/mergelist.txt

7.定量

# 建立定量文件夹
$ mkdir 5.quantity-data
$ cd 5.quantity-data && mkdir test1_rep1 test1_rep2 test2_rep1 test2_rep2 && cd ../
# 定量，单个样本，-e评估转录本表达丰度，-A评估基因表达丰度并输出，-G跟合并的gtf文件
$ stringtie  -p 10 -e -G ./stringtie_merged.gtf \
             -o 5.quantity-data/test1_rep1/test1_rep1.gtf \
             -A 5.quantity-data/test1_rep1/gene_abundances.tsv \
             3.map-data/test1_rep1.sorted.bam

批量定量：

# 创建一个脚本文件然后输入批量定量的脚本
$ vi quantity.sh
#!/bin/bash
# quantity
for i in test1 test2
do
       for j in rep1 rep2
       do
          stringtie  -p 10 -e -G ./stringtie_merged.gtf \
                     -o 5.quantity-data/${i}_${j}/${i}_${j}.gtf \
                     -A 5.quantity-data/${i}_${j}/gene_abundances.tsv \
                     3.map-data/${i}_${j}.sorted.bam
       done
done

最后会在相应文件夹下生成样本名.gtf和gene_abundances.tsv的两个文件，对应每个样本的 count 值定量结果，我们需要合并到一个文件里。

需要使用一个 prepDE.py 脚本，在 python2 环境中使用，下载地址如下：
prepDE.py (python2) : http://ccb.jhu.edu/software/stringtie/dl/prepDE.py
prepDE.py (python3) : http://ccb.jhu.edu/software/stringtie/dl/prepDE.py3

prepDE.py 需要一个 sample_list，第一列为样本名，第二列为 gtf 文件路径

$ head sample_list.txt
test1_rep1 ./5.quantity-data/test1_rep1/test1_rep1.gtf
test1_rep2 ./5.quantity-data/test1_rep2/test1_rep2.gtf
test2_rep1 ./5.quantity-data/test2_rep1/test2_rep1.gtf
test2_rep2 ./5.quantity-data/test2_rep2/test2_rep2.gtf
# 提取合并count结果，-i为输入sample_list
$  ./prepDE.py -i sample_list.txt

结束后在当前目录生成gene_count_matrix.csv和transcript_count_matrix.csv文件8.提取 FPKM/TPM 或 coverage 结果

获取提取脚本：https://github.com/griffithlab/rnaseq_tutorial/blob/master/scripts/stringtie_expression_matrix.pl

# 查看用法，--result_dirs跟上含有每个样本gtf的文件夹名称
$ ./stringtie_expression_matrix.pl
Required parameters missing
Usage:  ./stringtie_expression_matrix.pl --expression_metric=TPM  --result_dirs='HBR_Rep1,HBR_Rep2,HBR_Rep3,UHR_Rep1,UHR_Rep2,UHR_Rep3' --transcript_matrix_file=transcript_tpms_all_samples.tsv --gene_matrix_file=gene_tpms_all_samples.tsv
$ cd 5.quantity-data/
# 提取TPM
$ ./stringtie_expression_matrix.pl --expression_metric=TPM \
                                   --result_dirs='test1_rep1,test1_rep2,test2_rep1,test2_rep2' \
                                   --transcript_matrix_file=transcript_tpms_all_samples.tsv \
                                   --gene_matrix_file=gene_tpms_all_samples.tsv

# 提取FPKM
./stringtie_expression_matrix.pl --expression_metric=FPKM \
                                   --result_dirs='test1_rep1,test1_rep2,test2_rep1,test2_rep2' \
                                   --transcript_matrix_file=transcript_fpkms_all_samples.tsv \
                                   --gene_matrix_file=gene_fpkms_all_samples.tsv

# 提取coverage
./stringtie_expression_matrix.pl --expression_metric=coverage \
                                   --result_dirs='test1_rep1,test1_rep2,test2_rep1,test2_rep2' \
                                   --transcript_matrix_file=transcript_coverage_all_samples.tsv \
                                   --gene_matrix_file=gene_coverage_all_samples.tsv
# 在当前目录就会生成相应的基因和转录本的tpm、fpkm、coverage 结果

后续的差异分析和可视化等都可以基于 count 和 tpm 等文件操作了。

如果不需要发现新的转录本和基因的话，可直接基于 bam 文件走定量步骤

1、HTseq 定量（使用参考：使用 htseq-count 进行定量分析^[11]）
2、Subread 包中的 featureCounts 定量（使用参考：转录组定量-featureCounts^[12]）
3、stringtie 定量

[这里直接使用 stringtie 定量]
单样本定量：

# 建立定量文件夹
$ mkdir 5.quantity-data
$ cd 5.quantity-data && mkdir test1_rep1 test1_rep2 test2_rep1 test2_rep2 && cd ../
# 定量，单个样本，-e评估转录本表达丰度，-A评估基因表达丰度并输出，-G跟合并的gtf文件
$ stringtie  -p 10 -e -G ./Mus_musculus.GRCm38.102.gtf \
             -o 5.quantity-data/test1_rep1/test1_rep1.gtf \
             -A 5.quantity-data/test1_rep1/gene_abundances.tsv \
             3.map-data/test1_rep1.sorted.bam

批量定量：

# 创建一个脚本文件然后输入批量定量的脚本
$ vi quantity.sh
#!/bin/bash
# quantity
for i in test1 test2
do
        for j in rep1 rep2
        do
            stringtie  -p 10 -e -G ./Mus_musculus.GRCm38.102.gtf \
                        -o 5.quantity-data/${i}_${j}/${i}_${j}.gtf \
                        -A 5.quantity-data/${i}_${j}/gene_abundances.tsv \
                        3.map-data/${i}_${j}.sorted.bam
        done
done

后续直接走提取 count 和 tpm/fpkm 等结果

DESeq2 差异分析

# 安装DESeq2包
BiocManager::install('DESeq2')
# 加载包
library(DESeq2)
# 设置工作路径
setwd('D:\\rnaseq')
# 读入counts矩阵
gene_count_matrix <- read.csv("D:/rnaseq/gene_count_matrix.csv",row.names = 1)
count <- gene_count_matrix[rowSums(gene_count_matrix)>0,]
# 构建表型矩阵
colData <- data.frame(row.names = colnames(count),
                      condition = factor(c(rep('control',2),rep('treat',2)),
                                           levels=c('control','treat'))
                      )
# 查看
colData
#            condition
# test1_rep1   control
# test1_rep2   control
# test2_rep1     treat
# test2_rep2     treat

dds <- DESeqDataSetFromMatrix(countData = count, colData = colData,design = ~ condition)
dds <- DESeq(dds)
res <- results(dds)
diff_res <- as.data.frame(res)
diff_res$gene_name <- rownames(diff_res)
# 输出差异结果
write.table(diff_res,file = 'DESeq2_diff_results.csv',quote = F,sep = ',',row.names = F,col.names = T)

然后我们用我之前写的在线 shiny 火山图 APP在线画个火山图看看：

详情参考链接：

不错不错！

参考资料

[1]

转录本精确定量: https://www.nature.com/articles/nprot.2016.095

[2]

nature communications: https://www.nature.com/articles/s41467-017-00050-4

[3]

2.2.1版本: http://daehwankimlab.github.io/hisat2/

[4]

GEO: https://www.ncbi.nlm.nih.gov/gds/

[5]

ENA: https://www.ebi.ac.uk/ena/browser/home

[6]

sra-exploer: https://sra-explorer.info/#

[7]

FastQC 数据质控报告的详细解读: https://www.jianshu.com/p/dc6820eb342e?utm_campaign=maleskine&utm_content=note&utm_medium=seo_notes&utm_source=recommendation

[8]

官网下载已经建好的索引: http://daehwankimlab.github.io/hisat2/download/

[9]

ensembl: https://asia.ensembl.org/index.html

[10]

转录组比对软件HISAT2的使用说明: https://www.omicsclass.com/article/285

[11]

使用htseq-count进行定量分析: https://blog.csdn.net/weixin_43569478/article/details/108079249

[12]

转录组定量-featureCounts: https://www.bioinfo-scrounger.com/archives/407/

欢迎小伙伴留言评论！

今天的分享就到这里了，敬请期待下一篇！

最后欢迎大家分享转发，您的点赞是对我的鼓励和肯定！

如果觉得对您帮助很大，打赏一下吧！

观察｜官方通报陕西蒲城一职校学生坠亡：事发前与舍友发生口角和肢体冲突认定该生系高空坠落死亡

桐城一派｜倒在“跨年夜”的龚书记，13个字换来免职调查冤不冤？

比佟丽娅还恋爱脑，怀孕7次流产4次，目睹丈夫背叛却选择原谅

市管干部“龚书记”免职迷局

讣告！又一知名女星在家中去世，终年54岁，曾是无数人白月光…

RNA-seq : Hisat2+Stringtie+DESeq2